Data

My data is from 49,159 responses to Cattell’s personality test. I obtained it via kaggle, as uploaded by Bojan Tunguz (https://www.kaggle.com/datasets/tunguz/cattells-16-personality-factors), who obtained the data from the open-soyrce psychometrics project (https://openpsychometrics.org/_rawdata/). The original purpose of the data was to analyze individual’s personalities based on a model of 16 factors. I will be using the data to see what personality factors best predict gender, and create a predictive model based on the questions.

Variables

My variables of independent variables of interest are the 162 questions asked of the participants. They are discrete, as answers to a prompt on a scale of 1-5 with 5 being highly agree. No response data was marked as a 0.

My dependent variable is gender. It is stored as 1 being male and 2 being female, with 0 and 3 being NA or other.

I will use factor analysis to create new factors that will become the ultimate independent variables.

Question

What factors can be determined from the answers to the 162 questions? Will my analysis produce the same 16 that Cartell’s is based on?

What factors best predict gender? Do personality scores differ enough across gender to predict accurately?

Set up & initial wrangling

# look at dependent variable 
table(personality$gender)
## 
##     0     1     2     3 
##   238 19458 29183   280
# keep male and female
personality_clean <- personality %>%
  filter(gender== 1| gender==2) %>%
  mutate(gender = ifelse(gender=="2", 0, gender))

# remove all observations with any missing data (response of 0)
personality_clean <- personality_clean %>%
  filter_at(c(1:163),all_vars(.!=0))

# remove other variables
personality_questions <- na.omit(personality_clean[,-c(164:169)])

Dimension Reduction

Check for multicolliniarity

# get correlations
correlations <- cor(personality_questions)

# get correlations >=90
greater_than <- which(correlations >= 0.90, arr.ind = TRUE)
greater_than <- greater_than[
  greater_than[,"row"] < greater_than[,"col"],
]

No correlations are concerning

Perform initial PCA

# scale variables
scaled_personality <- scale(personality_questions)
# perform PCA
personality_pca <- prcomp(scaled_personality, center = T, scale. = T)

19 components predicted over 50%, so I expect to reduce to around that number.

Visualize the PCA

fviz_pca_var(
  personality_pca,
  col.var = "contrib",
  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
  repel = TRUE 
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

fviz_pca_ind(
  personality_pca,
  c = "point",
  col.ind = "cos2", 
  gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = FALSE
)

fviz_pca_biplot(
  personality_pca, repel = TRUE,
  col.var = "#FC4E07", 
  col.ind = "#00AFBB", 
  label = "var"
) 

The variables and variables look decently evenly distributed across dimensions.

Confirm PCA is appropriate

cortest.bartlett(scaled_personality)
## R was not square, finding R from data
## $chisq
## [1] 2559743
## 
## $p.value
## [1] 0
## 
## $df
## [1] 13203
KMO(scaled_personality)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = scaled_personality)
## Overall MSA =  0.97
## MSA for each item = 
##   A1   A2   A3   A4   A5   A6   A7   A8   A9  A10   B1   B2   B3   B4   B5   B6 
## 0.97 0.99 0.96 0.96 0.98 0.98 0.98 0.96 0.97 0.97 0.97 0.95 0.96 0.97 0.96 0.97 
##   B7   B8   B9  B10  B11  B12  B13   C1   C2   C3   C4   C5   C6   C7   C8   C9 
## 0.94 0.95 0.93 0.98 0.95 0.95 0.95 0.97 0.97 0.98 0.98 0.97 0.98 0.97 0.97 0.98 
##  C10   D1   D2   D3   D4   D5   D6   D7   D8   D9  D10   E1   E2   E3   E4   E5 
## 0.98 0.95 0.95 0.97 0.97 0.95 0.98 0.97 0.98 0.96 0.98 0.98 0.97 0.92 0.97 0.96 
##   E6   E7   E8   E9  E10   F1   F2   F3   F4   F5   F6   F7   F8   F9  F10   G1 
## 0.96 0.91 0.97 0.96 0.93 0.95 0.95 0.91 0.96 0.97 0.94 0.95 0.94 0.94 0.96 0.99 
##   G2   G3   G4   G5   G6   G7   G8   G9  G10   H1   H2   H3   H4   H5   H6   H7 
## 0.98 0.98 0.98 0.98 0.99 0.99 0.98 0.98 0.99 0.84 0.96 0.84 0.91 0.94 0.94 0.94 
##   H8   H9  H10   I1   I2   I3   I4   I5   I6   I7   I8   I9  I10   J1   J2   J3 
## 0.94 0.95 0.95 0.96 0.97 0.98 0.97 0.98 0.97 0.96 0.96 0.97 0.96 0.97 0.95 0.96 
##   J4   J5   J6   J7   J8   J9  J10   K1   K2   K3   K4   K5   K6   K7   K8   K9 
## 0.92 0.97 0.98 0.95 0.95 0.84 0.85 0.97 0.99 0.98 0.98 0.98 0.96 0.96 0.97 0.97 
##  K10   L1   L2   L3   L4   L5   L6   L7   L8   L9  L10   M1   M2   M3   M4   M5 
## 0.98 0.98 0.99 0.98 0.98 0.97 0.97 0.98 0.96 0.98 0.98 0.93 0.97 0.96 0.97 0.96 
##   M6   M7   M8   M9  M10   N1   N2   N3   N4   N5   N6   N7   N8   N9  N10   O1 
## 0.95 0.97 0.95 0.96 0.97 0.98 0.98 0.95 0.97 0.96 0.95 0.97 0.98 0.98 0.97 0.93 
##   O2   O3   O4   O5   O6   O7   O8   O9  O10   P1   P2   P3   P4   P5   P6   P7 
## 0.95 0.94 0.89 0.95 0.91 0.92 0.92 0.93 0.96 0.96 0.96 0.93 0.98 0.99 0.98 0.89 
##   P8   P9  P10 
## 0.96 0.94 0.98

PCA is appropriate for the data. In Bartlett’s test, p < .05. The overall MSA is .97, which is “marvelous.” No individual MSA is below .8, which is “meritorious”

Find ideal number of factors

# initial pca
initial_pca <- principal(scaled_personality, nfactors = ncol(scaled_personality), rotate ="oblimin")
## Loading required namespace: GPArotation
plot(initial_pca$values, type = "b", ylab = "Eigenvalues"); abline(h = 1);

# parallel pca
parallel_pca <- fa.parallel(
  x = scaled_personality, fa = "pc",
  sim = FALSE 
)

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  22

The plots agree that 22 is optimal

Perform final PCA

final_pca <- principal(
  r = scaled_personality,
  nfactors = 22, # ideal 22 factors
  rotate = "oblimin", 
  residuals = TRUE
)

# check for normal residuals
# get residuals
resid <- final_pca$residual
lower_resid <- resid[lower.tri(resid)]
# perform shapiro test
shapiro.test(sample(final_pca$residual, 5000))
## 
##  Shapiro-Wilk normality test
## 
## data:  sample(final_pca$residual, 5000)
## W = 0.37146, p-value < 2.2e-16
# plot
hist(lower_resid)

The Shapiro test rejected the null hypothesis that the residuals were normal, but looking at the plot we see that they are close enough to not be concerned.

See what questions were included in each factor

# get loadings
loadings <- round(final_pca$loadings[,], 3)
# keep loadings above .3
loadings[loadings < 0.30] <- "" 

Looking at the loadings, I found these questions for each component and added a label:

Component 1: Social

Component 2: Simple Minded

Component 3: Depressed

Component 4: Empathetic

Component 5: Scattered

Component 6: Unique

Component 7: Open

Component 8: Judgmental

Component 9: Introverted

Component 10: Intellectual

Component 11: Trusting

Component 12: Bold

Component 13: Defiant

Component 14: Insecure

Component 15: Moody

Component 16: Funny

Component 17: Serious

Component 18: Analytical

Component 19: Quiet

Component 20: Sentimental

Component 21: Traditional

Component 22: Present

# combine the scores from PCA with the gender data
personality_scores <- as.data.frame(final_pca$scores)
personality_traits<- cbind(personality_clean$gender, personality_scores) 

# rename components with descriptive labels
colnames(personality_traits) <- c("Gender", "Social", "Defiant", "Open", "Trusting","Depressed", "Bold", "Simple-minded","Empathetic", "Scattered","Intellectual", "Analytical", "Funny", "Serious", "Unique", "Judgemental", "Sentimental", "Present", "Quiet", "Insecure","Traditional", "Moody", "Introverted")

Visualize variables of interest

# box plots of value vs gender
personality_traits %>%
  # change data to long format & subset first half
  pivot_longer(c(2:12)) %>%
  ggplot(aes(value, factor(Gender 
                           # label Gender variable as Female and Male
                           ,labels = c("Female", "Male") 
                           ))) +
  geom_boxplot()+
  facet_wrap(~name) + 
  labs(title = "Gender Vs. Normalized Trait Score", 
       y = "Gender", 
       x= "Normalized Trait Score", 
       subtitle = "For Components of Cattell's Personality Test" )

personality_traits %>%
  pivot_longer(c(13:23)) %>% 
  # change data to long format and subset second half
  ggplot(aes(value, factor(Gender
                           # label Gender variable as Female and Male
                           ,labels = c("Female", "Male")))) +
  geom_boxplot()+
  facet_wrap(~name)+
    labs(title = "Gender Vs. Normalized Trait Score", 
       y = "Gender", 
       x= "Normalized Trait Score", 
       subtitle = "For Components of Cattell's Personality Test" )

# ggpairs 
# subset data to make plot readable
traits_balanced_1 <-personality_traits[,c(1:8)]
traits_balanced_2 <-personality_traits[,c(1,9:15)]
traits_balanced_3 <-personality_traits[,c(1,16:23)]
# plot
ggpairs(data = traits_balanced_1)

ggpairs(data = traits_balanced_2)

ggpairs(data = traits_balanced_3)

From this visualization it looks as though the bold, funny, quiet, and intellectual traits are more stronger in females. The defiant and insecure traits are stronger in men.

The ggpairs shows a roughly normal distribution for each trait. Social, trusting, simple-minded, and unique were the only variables without a significant relationship to gender. This agrees with the box plot, where none of these traits had noticeably different distributions in men and women.

Model Data

Because the outcome variable is binary (men or women), I will use logistic regression

Check model assumptions

Multicollinearity

new_cor <- cor(personality_traits[,-1])

# get correlations >= 4
greater_than_new <- which(new_cor >= 0.4, arr.ind = TRUE)
greater_than_new <- greater_than_new[
  greater_than_new[,"row"] < greater_than_new[,"col"],
]

No factors have a higher correlation than 4, so multicollinearity should not be a problem.

# check balance in outcome
table(personality_traits$Gender)
## 
##     0     1 
## 21171 13856
# rebalance
traits_balanced <- personality_traits[
  c(
  which(personality_traits$Gender == 1), #keep all men
  # sample 13856 women
  sample(
    which(personality_traits$Gender == 0),13856)),
]

There were more women than men before balancing.

Split data into training and testing

# use 70% of the observations for training
train_index <- sample( 1:nrow(traits_balanced), round(nrow(traits_balanced) * 0.70)
)

# save other 30% for testing
test_index <- setdiff(
  1:nrow(traits_balanced),
  train_index)

Model Selection

# perform initial logistic regression
trait_lrm<- glm(
  formula = Gender ~.,
  data = traits_balanced[train_index,],
  family = "binomial"
)
# check for multicollinearity 
vif(trait_lrm)
##          Social         Defiant            Open        Trusting       Depressed 
##        1.543823        1.345400        1.358792        1.429380        1.388699 
##            Bold `Simple-minded`      Empathetic       Scattered    Intellectual 
##        1.425512        1.368721        1.395945        1.391314        1.272279 
##      Analytical           Funny         Serious          Unique     Judgemental 
##        1.289693        1.305068        1.223960        1.232676        1.258602 
##     Sentimental         Present           Quiet        Insecure     Traditional 
##        1.215320        1.205212        1.161281        1.186143        1.128292 
##           Moody     Introverted 
##        1.121279        1.169594

Use LASSO regularization

# get matricies of predictors and outcomes
new_predictors <-as.matrix(traits_balanced[train_index,c(2:23)])
test_predictors <-as.matrix(traits_balanced[test_index,c(2:23)])
new_outcome <-as.matrix(traits_balanced[train_index,1])
test_outcome <- as.matrix(traits_balanced[test_index,1])

# perform cross validation to determine best lambda
cv_lasso_h <- cv.glmnet(
  as.matrix(new_predictors),
  as.matrix(new_outcome),
  alpha=1
)

# print results
cv_lasso_h
## 
## Call:  cv.glmnet(x = as.matrix(new_predictors), y = as.matrix(new_outcome),      alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##       Lambda Index Measure       SE Nonzero
## min 0.000650    63  0.1699 0.001400      20
## 1se 0.007306    37  0.1711 0.001313      18
plot(cv_lasso_h)

# save best lambda
best_lambda <- cv_lasso_h$lambda.min

The best lambda found was .00451.

Perform LASSO with this lambda

# perform lasso
best_lasso <-  glmnet(
  as.matrix(new_predictors),
  as.matrix(new_outcome),
  family = "binomial",
  alpha=1,
  lambda = best_lambda
)

# obtain coefficients
# lasso coefficients
lasso_coef <- unname(as.matrix(coef(best_lasso)))
# regular logistic coefficients
lrm_coef <- coef(trait_lrm)
# combine into data frame
data.frame(
  lasso = lasso_coef,
  lrm = lrm_coef,
  difference = lasso_coef - lrm_coef
)
##                         lasso          lrm   difference
## (Intercept)     -0.1920954794 -0.195575951  0.003480472
## Social          -0.0566651392 -0.062106457  0.005441318
## Defiant          0.3117777794  0.321857344 -0.010079565
## Open             0.0035417797  0.012444832 -0.008903052
## Trusting         0.1557767189  0.163960797 -0.008184078
## Depressed        0.1354682467  0.145362806 -0.009894559
## Bold            -0.4261599052 -0.434353396  0.008193491
## `Simple-minded`  0.0007151962  0.008270710 -0.007555513
## Empathetic      -0.1801554083 -0.188345882  0.008190474
## Scattered       -0.6547446689 -0.666111223  0.011366554
## Intellectual    -0.3429940673 -0.351001556  0.008007488
## Analytical      -0.2264795008 -0.235051004  0.008571503
## Funny           -0.4690464043 -0.479494189  0.010447784
## Serious          0.0000000000  0.002399858 -0.002399858
## Unique          -0.0502543185 -0.053986364  0.003732045
## Judgemental      0.0494206367  0.054275054 -0.004854418
## Sentimental     -0.0866268511 -0.093985222  0.007358371
## Present          0.3195488689  0.328614091 -0.009065222
## Quiet           -1.1281894876 -1.141668872  0.013479384
## Insecure         0.3637761751  0.370222341 -0.006446166
## Traditional     -0.1128722623 -0.120857974  0.007985712
## Moody           -0.0154188482 -0.021129333  0.005710485
## Introverted      0.3943762702  0.402377628 -0.008001357

Most factors had similar coefficients. LASSO removed simple-minded entirely, as expected by it’s insignificant relationship with gender. Quiet changed the most, with a decrease in magnitude of .009.

Model evaluation & selection

# Get predicted from lasso
lasso_predicted <- predict( 
  best_lasso, newx = new_predictors
)

# Get predicted from regular
lrm_predicted <- predict(
  trait_lrm, newx= new_predictors
)

# R-squared 
cor(lasso_predicted, new_outcome)^2
##         [,1]
## s0 0.3224526
cor(lrm_predicted, new_outcome)^2
##           [,1]
## [1,] 0.3224807
# RMSE
sqrt(mean((lasso_predicted-new_outcome)^2))
## [1] 1.542033
sqrt(mean((lrm_predicted-new_outcome)^2))
## [1] 1.567162

Both models explained about 32% of variability in data. The average difference between model prediction and actual value was also about the same, but the lasso performed slightly better with an RMSE of 1.56 compared to 1.58.

# Convert prediction to gender factor
# for lasso
gender_lasso <- factor( 
  ifelse(lasso_predicted > 0.50, 1, 0))
# for regular logistic
gender_lrm <- factor( 
  ifelse(lrm_predicted > 0.50, 1, 0))

# Get test classes
# for lasso
test_lasso <- factor( 
  ifelse(
    # make prediction on new data
    predict( best_lasso,
             test_predictors) > 0.50, 1, 0))
# for regular
test_lrm <- factor( 
  ifelse(
    # make prediction on new data
    predict(trait_lrm,
             as.data.frame(test_predictors)) > 0.50, 1, 0))

Confusion matrices

# training data
# lasso
confusionMatrix(data = gender_lasso, positive = "1",
                reference = factor(new_outcome))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8387 3594
##          1 1315 6102
##                                          
##                Accuracy : 0.7469         
##                  95% CI : (0.7408, 0.753)
##     No Information Rate : 0.5002         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4938         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.6293         
##             Specificity : 0.8645         
##          Pos Pred Value : 0.8227         
##          Neg Pred Value : 0.7000         
##              Prevalence : 0.4998         
##          Detection Rate : 0.3146         
##    Detection Prevalence : 0.3824         
##       Balanced Accuracy : 0.7469         
##                                          
##        'Positive' Class : 1              
## 
# regular logistic
confusionMatrix(data = gender_lrm, positive = "1",
                reference = factor(new_outcome))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8370 3580
##          1 1332 6116
##                                           
##                Accuracy : 0.7468          
##                  95% CI : (0.7406, 0.7529)
##     No Information Rate : 0.5002          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4935          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6308          
##             Specificity : 0.8627          
##          Pos Pred Value : 0.8212          
##          Neg Pred Value : 0.7004          
##              Prevalence : 0.4998          
##          Detection Rate : 0.3153          
##    Detection Prevalence : 0.3840          
##       Balanced Accuracy : 0.7467          
##                                           
##        'Positive' Class : 1               
## 
# testing data
# lasso
confusionMatrix(data = test_lasso, positive = "1",
                reference = factor(test_outcome))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3603 1510
##          1  551 2650
##                                           
##                Accuracy : 0.7521          
##                  95% CI : (0.7427, 0.7614)
##     No Information Rate : 0.5004          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5043          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6370          
##             Specificity : 0.8674          
##          Pos Pred Value : 0.8279          
##          Neg Pred Value : 0.7047          
##              Prevalence : 0.5004          
##          Detection Rate : 0.3187          
##    Detection Prevalence : 0.3850          
##       Balanced Accuracy : 0.7522          
##                                           
##        'Positive' Class : 1               
## 
# regular logistic
confusionMatrix(data = test_lrm, positive = "1",
                reference = factor(test_outcome))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3601 1498
##          1  553 2662
##                                           
##                Accuracy : 0.7533          
##                  95% CI : (0.7439, 0.7625)
##     No Information Rate : 0.5004          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5067          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6399          
##             Specificity : 0.8669          
##          Pos Pred Value : 0.8280          
##          Neg Pred Value : 0.7062          
##              Prevalence : 0.5004          
##          Detection Rate : 0.3202          
##    Detection Prevalence : 0.3867          
##       Balanced Accuracy : 0.7534          
##                                           
##        'Positive' Class : 1               
## 

The regular logistic model has a slightly higher Kappa value of .4974 as compared to LASSO’s of .4961. The overall accuracy, sensitivity, and specificity of the logistic model were about the same in both models, at around .74, .63, and .86. As noted by the higher specificity, the model had more success predicting women than men (coded as 1).

In the testing data, the Kappa values were even higher with .5122 (regular logistic) and .5134 (LASSO). The accuracy was also higher with both around .76 as compared to .74. The sensitivity was slightly higher (.64), as well as specificity, which was still higher at .87.

There was not a huge difference in models, and slightly higher kappa values in the regular logistic model than the one using LASSO.

Try another model

Since the model with non-significant predictors was better than the LASSO, I will try removing them.

# keep only significant predictors
traits_signif <- traits_balanced[train_index,-c(2,4,8,14,21,22)]

# run logistic regression
trait_lrm_2<- glm(
  formula = Gender ~.,
  data = traits_signif,
  family = "binomial"
)

# Get predicted gender
lrm_predicted_2 <- predict(
  trait_lrm, newx= traits_signif[,-1]
)

gender_lrm_2 <- factor( 
  ifelse(lrm_predicted_2 > 0.50, 1, 0))

# Get test classes
test_lrm_2 <- factor( 
  ifelse(
    # make prediction on new data
    predict(trait_lrm_2,
             as.data.frame(test_predictors)) > 0.50, 1, 0))

# R-squared 
cor(lrm_predicted_2, new_outcome)^2
##           [,1]
## [1,] 0.3224807
# RMSE
sqrt(mean((lrm_predicted_2-new_outcome)^2))
## [1] 1.567162
# confusion marticies
# training data
confusionMatrix(data = gender_lrm_2, positive = "1",
                reference = factor(new_outcome))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 8370 3580
##          1 1332 6116
##                                           
##                Accuracy : 0.7468          
##                  95% CI : (0.7406, 0.7529)
##     No Information Rate : 0.5002          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4935          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6308          
##             Specificity : 0.8627          
##          Pos Pred Value : 0.8212          
##          Neg Pred Value : 0.7004          
##              Prevalence : 0.4998          
##          Detection Rate : 0.3153          
##    Detection Prevalence : 0.3840          
##       Balanced Accuracy : 0.7467          
##                                           
##        'Positive' Class : 1               
## 
# testing data
confusionMatrix(data = test_lrm_2, positive = "1",
                reference = factor(test_outcome))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3603 1515
##          1  551 2645
##                                           
##                Accuracy : 0.7515          
##                  95% CI : (0.7421, 0.7608)
##     No Information Rate : 0.5004          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5031          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6358          
##             Specificity : 0.8674          
##          Pos Pred Value : 0.8276          
##          Neg Pred Value : 0.7040          
##              Prevalence : 0.5004          
##          Detection Rate : 0.3181          
##    Detection Prevalence : 0.3844          
##       Balanced Accuracy : 0.7516          
##                                           
##        'Positive' Class : 1               
## 
# check AIC of models
AIC(trait_lrm_2)
## [1] 19372.91
AIC(trait_lrm)
## [1] 19338.98

This model still explains 32% of variability in the data with an RMSE of 1.58. The Kappa value, specificity, sensitivity, and accuracy, still are the same, as removing the insignificant predictors did not change much. However, looking at the AIC which takes in model complexity, the new model performed better with 19257 vs 19288.

Findings

final_trait_lrm <- trait_lrm_2

# odds ratio
exp(final_trait_lrm$coefficients)
##  (Intercept)      Defiant     Trusting    Depressed         Bold   Empathetic 
##    0.8227362    1.3834987    1.1764748    1.1570323    0.6510598    0.8245394 
##    Scattered Intellectual   Analytical        Funny       Unique  Judgemental 
##    0.5068047    0.7003969    0.7945459    0.6175046    0.9576560    1.0469083 
##  Sentimental      Present        Quiet     Insecure  Introverted 
##    0.9121419    1.3796611    0.3191599    1.4446720    1.4918170

Strongest predictors of men:

Strongest predictors of women:

Summary

The PCA found 22 components from the 162 questions on the personality test: social, simple-minded, depressed, empathetic, scattered, unique, open, judgmental, introverted, intellectual, trusting, bold, defiant, insecure, moody, funny, serious, analytical, quiet, sentimental, traditional, present.

Logistic Regression was performed normally and with LASSO regularization. The logistic regression performed better. Then, I removed the insignificant traits of social, open, simple-minded, serious, traditional, moody. These traits had little to no effect on gender of respondent.

This final model (composed of depressed, empathetic, scattered, unique, judgmental, introverted, intellectual, trusting, bold, defiant, insecure, funny, analytical, quiet, sentimental, and present) explained 32% of variability in the data with an RMSE of 1.58. The model was 74% accurate on training data and 76% accurate on testing data. The model performed better predicting women than men.

The model’s strongest predictors (odds ratio > 1.3) of men were high scores in introverted, insecure, present, and defiant. The model’s strongest predictors of women were high scores in quiet, scattered, funny, bold, and intellectual.

Research Question

What factors can be determined from the answers to the 162 questions?

Social, simple-minded, depressed, empathetic, scattered, unique, open, judgmental, introverted, intellectual, trusting, bold, defiant, insecure, moody, funny, serious, analytical, quiet, sentimental, traditional, and present.

Will my analysis produce the same 16 that Cartell’s is based on?

No, my analysis did not produce the same 16.

What factors best predict gender?

Depressed, empathetic, scattered, unique, judgmental, introverted, intellectual, trusting, bold, defiant, insecure, funny, analytical, quiet, sentimental, and present

Do personality scores differ enough across gender to predict accurately?

The low r^2 value (.32) and high RMSE (1.58) show that the model does not perform extremely well on the data. However, the model could predict with 76% accuracy on testing data which shows that it does perform better than random and gender does have some effect on scores.

Implications

My findings are interesting, as the predictors did not always predict the gender I was expecting. For example, I would not have expected bold as a female trait or insecure as a male one. I find this interesting because people’s own feelings and responses do not necessarily conform to gender stereotypes. The low r^2 value and high RMSE also show that gender is difficult to predict from responses to personality questions. This shows that there are not major differences between the genders.

Limitations

The LASSO regularization did not work well on the data due to the large sample compared to small number of predictors. If I would have performed the regression before feature selection and removing variables, it may have had a different outcome.

The data was response from humans, who are unpredictable. Their answers may not accurately reflect their actual traits but rather their self-perception of those traits.

Also, the questions in the dataset were pointed at specific traits which my PCA did not pick out. Some were similar, but it split the questions in different ways. Therefore, when I performed my regression, the amount of questions included in each factor was not equal. For example “present” only had one, and “unique” had 10.

Future Directions

I would be interested in looking at prediction using Cartell’s original 16 factors. I also would be interested in looking at another variable in the data, such as age or country, and seeing if they were impacted by the factors I found.